Report for the Applied Data Science Capstone project

by Andreas Johannes

Salvation of a Rose salesman

1 Background
2 Data sources and treatment
3 Methodology

4 Results
5 Discussion
6 Conclusion

1 Background

back to top

Be this guy!

Sell your Roses here (or rather there)!

Parisian Rose seller, this could be you! Whether you are selling Roses to couples or playing your Fiddle for tips, you want to know where the most restaurants and bars are, because that's were the most money can be made. Read on for an depth analysis of where to go tonight to ply your trade.

PLUS if you know you made money in one area, use our similarity rating to find similar areas for your next nights work!

make the machiene work for you

Summary:

To find the best areas to sell Roses on the street:

  • Grade areas in Paris according to how many restaurants and bars there are in them
    • Show this data on a map of Paris
    • By restaurant category/type
  • Find locations which offer similar night life options
    • generally categorize areas
    • given a starting address, find similar areas
      Probably this analysis may be useful outside of the rose-selling market, but that's a future venture.

2 Data sources and treatment

back to top

The heat-map

  • We will segment Paris into evenly sized tiles
  • Use Foursquare to obtain a count for the restaurants and bars in each tile.
  • Categorize Restaurants and bars in 4-8 categories (eg. Bar, club, Fast Food etc.)
  • Use folium to plot heat-map tiles onto a map of Paris for each category
  • Sort by number of found places to suggest best areas.

Similar areas

Use above categories to find areas that are similar:

  • Inspect distribution to see how many area categories are sensible
  • use k-means to group this number of categories
  • Plot to map of Paris
  • given a location, use generalized distance across features (as used in k-means algorithm) to produce a sorted list of areas similar to the current location.

3 Methodology

back to top

In this section we will execute the strategy outlined in the previous section.

3.1 Paris map

back to top

In [1]:
import numpy as np
import pandas as pd
import folium
import pygeoj
import matplotlib.pyplot as plt
import seaborn as sns

Create a regular hexagonal grid around Paris. We wil use cube coordinates centered around the center of paris accordintg to wiki: paris. The tiles will be spaced 200 m appart and we will have 50 tiles in each direction. This covers the center of Paris quite well and should have sufficient resolution. See [https://www.redblobgames.com/grids/hexagons/] for an introduction to hexagonal coordinates.

In [2]:
# get a 3D grid from with 2*tile_count + 1 number of tiles across
tile_count = tc = 30
p_range, q_range, r_range = range(-tc,tc+1),range(-tc,tc+1),range(-tc,tc+1)
r_i, q_i, p_i = np.meshgrid(p_range, q_range, r_range)
pqr_i = np.stack([p_i.flat, q_i.flat, r_i.flat])
# reduce grid to include only the indexes on our hexagonal plane
hex_mask = pqr_i.sum(axis=0)==0
hex_mask
#xyz_hex = np.empty(shape=(3,hex_mask.sum()),dtype=np.int32)
pqr_hex = pqr_i[:,hex_mask]
pqr_hex.dtype, pqr_hex.T.shape
Out[2]:
(dtype('int32'), (2791, 3))

We have an index grid, not to convert it into geospacial coordinates. We want the spacing to be tile_size, and first need to convert that to angular distances. We will only cover a small segment of the sperical earth and use the apropriate simplifiations. see wiki: geographic coordinates

In [3]:
tile_size = ts = 200. # m
earth_radius =  6367449
center_of_paris = (48.8567, 2.3508)
# in angle per meter
lon_conversion = 180./(np.pi*earth_radius)
lat_conversion = 180/(np.pi*earth_radius)*np.cos(np.pi/180.0*center_of_paris[0])

# defining vectors to get form the center of the hex to corner points in angles
ns = 0.5*ts*lon_conversion
ew = 0.5*ts*lat_conversion
s60 = np.sin(60./180.*np.pi)
c60 = np.cos(60./180.*np.pi)
x_step = (ns, 0)
y_step = (-ns*c60, ew*s60)
z_step = (-ns*c60, -ew*s60)
step_vector = np.asarray((x_step, y_step, z_step)).T

def get_corners(step_vector, center):
    '''
    returns the list of coordinates for the corners of a hexagon defined by
    the hexagonal step vector and a center point
    ''' 
    coordinates = []
    perms = [[1,0,0],
             [0,0,-1],
             [0,1,0],
             [-1,0,0],
             [0,0,1],
             [0,-1,0]]
             
    for perm in perms:
        coordinates.append(list(center + np.dot(step_vector,perm)))
    return coordinates

We have all we need to create the hexagonal grid mapped over Paris.

In [4]:
# usefull library to create geojson files
# https://github.com/karimbahgat/PyGeoj
# creating regular tiles around city center
json_tiles = pygeoj.new()
json_tiles_fname = "tiles.geojson"
coords_str_list = []
center_list = []
p_list = []
q_list = []
r_list = []
for coords in pqr_hex.T:
    # create a geojson file
    coords_str=('_').join([str(x) for x in coords])
    coords_str_list.append(coords_str)
    p_list.append(coords[0])
    q_list.append(coords[1])
    r_list.append(coords[2])
    
    center = center_of_paris[::-1] + np.dot(step_vector, coords)
    center_list.append(center)
    coordinates = get_corners(step_vector, center)
    json_tiles.add_feature(
        properties={"coords_str":coords_str},
        geometry={"type":"Polygon", "coordinates":[coordinates]})

json_tiles.add_all_bboxes()
json_tiles.update_bbox()
json_tiles.add_unique_id()
json_tiles.save(json_tiles_fname)
center_list[0], center_of_paris
Out[4]:
(array([ 2.39129204, 48.84131851]), (48.8567, 2.3508))
In [5]:
# create a corresponding dataframe:
center_array = np.asarray(center_list)
df_tiles = pd.DataFrame({'coords_str':coords_str_list, 
                         'lat':center_array[:,1],
                         'lon':center_array[:,0]})
latdist_array = (np.asarray(df_tiles.lat)-center_of_paris[0])/lat_conversion
londist_array = (np.asarray(df_tiles.lon)-center_of_paris[1])/lon_conversion
df_tiles['distance_to_center'] = np.asarray(np.sqrt(latdist_array**2 + londist_array**2),
                                            dtype=np.int32)
df_tiles['p'] = p_list
df_tiles['q'] = q_list
df_tiles['r'] = r_list
no_tiles = df_tiles.shape[0]
In [6]:
map_paris = folium.Map(location=center_of_paris, zoom_start=12)
test_df = pd.DataFrame({'Paris':1}, columns=['City','Value'])
# Add the color for the chloropleth:
folium.Choropleth(
    geo_data=json_tiles_fname,
    name='choropleth',
    data=df_tiles,
    fill_color='Blues',
    columns=['coords_str', 'distance_to_center'],
    key_on='feature.properties.coords_str',
    fill_opacity=0.5, 
    line_opacity=0.1,
    legend_name='Distance to Center',   
).add_to(map_paris)


map_paris
Out[6]:
In [7]:
df_tiles
Out[7]:
coords_str lat lon distance_to_center p q r
0 30_-30_0 48.841319 2.391292 5196 30 -30 0
1 29_-30_1 48.840806 2.389942 5111 29 -30 1
2 28_-30_2 48.840293 2.388593 5031 28 -30 2
3 27_-30_3 48.839780 2.387243 4956 27 -30 3
4 26_-30_4 48.839268 2.385893 4886 26 -30 4
... ... ... ... ... ... ... ...
2786 -26_30_-4 48.874132 2.315707 4886 -26 30 -4
2787 -27_30_-3 48.873620 2.314357 4956 -27 30 -3
2788 -28_30_-2 48.873107 2.313007 5031 -28 30 -2
2789 -29_30_-1 48.872594 2.311658 5111 -29 30 -1
2790 -30_30_0 48.872081 2.310308 5196 -30 30 0

2791 rows × 7 columns

3.2 Foursquare data

back to top

Noe that we have the grid on which we want to check for locations, lets use foursquare to find them. We will immedeately collect different restaurant types seperately for later. see foursquare:categories

In [8]:
import requests

# not sharing foursquare credentials
with open('../../foursquare_credentials.dat','r') as f:
    client_id, client_secret = f.readlines()
client_id = client_id[:-1]
version = '20180724'

Manual selection of some categories:

In [9]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
nightlife_category = '4d4b7105d754a06376d81259'# 'Root' category for all nightlife venues
# other categories:
categories_dict = {'other':['503288ae91d4c4b30a586d67',
                                 '4bf58dd8d48988d1c8941735',
                                 '4bf58dd8d48988d14e941735',
                                 '4bf58dd8d48988d169941735',
                                 '52e81612bcbc57f1066b7a01',
                                 '4bf58dd8d48988d1df931735',
                                 '52e81612bcbc57f1066b79f4',
                                 '4bf58dd8d48988d17a941735',
                                 '4bf58dd8d48988d144941735',
                                 '4bf58dd8d48988d108941735',
                                 '4bf58dd8d48988d120951735',
                                 '4bf58dd8d48988d1be941735',
                                 '4bf58dd8d48988d1c1941735',
                                 '56aa371be4b08b9a8d573508',
                                 '4bf58dd8d48988d1c4941735',
                                 '4bf58dd8d48988d1ce941735',
                                 '4bf58dd8d48988d1cc941735',
                                 '4bf58dd8d48988d1dc931735',
                                 '56aa371be4b08b9a8d573538'],
                        'sweet':['4bf58dd8d48988d146941735',
                                 '52e81612bcbc57f1066b79f2',
                                 '4bf58dd8d48988d1d0941735',
                                 '4bf58dd8d48988d148941735'],
                        'european':['52f2ae52bcbc57f1066b8b81',
                                    '5293a7d53cf9994f4e043a45',
                                    '4bf58dd8d48988d147941735',
                                    '5744ccdfe4b0c0459246b4d0',
                                    '4bf58dd8d48988d109941735',
                                    '52e81612bcbc57f1066b7a05',
                                    '52e81612bcbc57f1066b7a09',
                                    '4bf58dd8d48988d10c941735',
                                    '52e81612bcbc57f1066b79fa',
                                    '4bf58dd8d48988d110941735',
                                    '52e81612bcbc57f1066b79fd',
                                    '4bf58dd8d48988d1c0941735',
                                    '52e81612bcbc57f1066b79f9',
                                    '4bf58dd8d48988d1c2941735',
                                    '52e81612bcbc57f1066b7a04',
                                    '4def73e84765ae376e57713a',
                                    '5293a7563cf9994f4e043a44',
                                    '4bf58dd8d48988d1c6941735',
                                    '5744ccdde4b0c0459246b4a3',
                                    '56aa371be4b08b9a8d57355a',
                                    '4bf58dd8d48988d150941735',
                                    '4bf58dd8d48988d158941735',
                                    '4f04af1f2fb6e1c99f3db0bb',
                                    '52e928d0bcbc57f1066b7e96'],
                        'asian':['4bf58dd8d48988d142941735',
                                 '4bf58dd8d48988d10f941735',
                                 '4bf58dd8d48988d115941735',
                                 '52e81612bcbc57f1066b79f8',
                                 '5413605de4b0ae91d18581a9'],
                        'fast':['4bf58dd8d48988d179941735',
                                '4bf58dd8d48988d16a941735',
                                '52e81612bcbc57f1066b7a02',
                                '52e81612bcbc57f1066b79f1',
                                '4bf58dd8d48988d143941735',
                                '52e81612bcbc57f1066b7a0c',
                                '4bf58dd8d48988d16c941735',
                                '4bf58dd8d48988d128941735',
                                '4bf58dd8d48988d16d941735',
                                '4bf58dd8d48988d1e0931735',
                                '52e81612bcbc57f1066b7a00',
                                '4bf58dd8d48988d10b941735',
                                '4bf58dd8d48988d16e941735',
                                '4edd64a0c7ddd24ca188df1a',
                                '56aa371be4b08b9a8d57350b',
                                '4bf58dd8d48988d1cb941735',
                                '4d4ae6fc7a7b7dea34424761',
                                '5283c7b4e4b094cb91ec88d7',
                                '4bf58dd8d48988d1ca941735',
                                '4bf58dd8d48988d1c5941735',
                                '4bf58dd8d48988d1bd941735',
                                '4bf58dd8d48988d1c7941735',
                                '4bf58dd8d48988d1dd931735'],
                   'night_life':['52e81612bcbc57f1066b7a06',
                                 nightlife_category]}
# fix feature order:
feature_list = list(categories_dict.keys())
feature_list.sort()
feature_list
Out[9]:
['asian', 'european', 'fast', 'night_life', 'other', 'sweet']

Unfortunately, some of these are parent categories which will not be correctly counted if we dont poll all childern. Therefore, we need to delve a little deeper intho the foursquare category system. As example the nightlife category has some subcategories.

In [10]:
get_categories_url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
        client_id, client_secret, version)
all_foursquare_categories = requests.get(get_categories_url).json()['response']['categories']
In [11]:
def get_category_by_id(parent, category_id, result=None):
    if result == None:
        if type(parent) == list:    
            for parent_category in parent:
                result = get_category_by_id(parent_category, category_id, result)
        elif type(parent) == dict:
            if parent['id'] == category_id:
                return parent
            elif len(parent['categories'])!=0:
                for item in parent['categories']:
                    result = get_category_by_id(item, category_id, result)
            else:
                result = None
        return result
    else:
        return result
    
nightlife_categories = get_category_by_id(all_foursquare_categories, nightlife_category)

def get_descendant_categories(parent, categories=[], verbose=False):
    if type(parent) == list:    
        for parent_category in parent:
            categories = get_descendant_categories(parent_category, categories, verbose)
        return categories
    
    elif type(parent) == dict:
        if verbose:
            print(parent['name'], len(categories)+1)
        categories.append(parent['id'])
        if len(parent['categories'])==0:
            return categories
        else:
            for item in parent['categories']:
                categories = get_descendant_categories(item, categories, verbose)
        return categories
    
nl = get_descendant_categories(nightlife_categories, [], verbose=True)
len(nl)
Nightlife Spot 1
Bar 2
Beach Bar 3
Beer Bar 4
Beer Garden 5
Champagne Bar 6
Cocktail Bar 7
Dive Bar 8
Gay Bar 9
Hookah Bar 10
Hotel Bar 11
Karaoke Bar 12
Pub 13
Sake Bar 14
Speakeasy 15
Sports Bar 16
Tiki Bar 17
Whisky Bar 18
Wine Bar 19
Brewery 20
Lounge 21
Night Market 22
Nightclub 23
Other Nightlife 24
Strip Club 25
Out[11]:
25

Now we can iterate over the above manually created categories dict to get a a comprehensive set of all related categories' id's.

In [12]:
catsets_dict = {}
for key, categories in categories_dict.items():
    key_list = []
    for cat_id in categories:
        parent_cat = get_category_by_id(all_foursquare_categories, cat_id)
        key_list += get_descendant_categories(parent=parent_cat, categories=key_list, verbose=False)
    catsets_dict.update({key:set(key_list)})
for key, val in catsets_dict.items():
    print("category {} has {} id's".format(key, len(val)))
category other has 49 id's
category sweet has 9 id's
category european has 96 id's
category asian has 129 id's
category fast has 23 id's
category night_life has 26 id's

We define the functions that will GET the Foursquare data for each area around Paris and filter the categories. We choose a radius of 250 m which causes a little overlap. Double-counting a few venues should not critically affect the this analysis.

In [13]:
def get_categories(categories):
    return [cat['id'] for cat in categories]

def count_categories(catsets_dict, found_categories):
    result = dict([[x,0] for x in catsets_dict.keys()])
    for key, categories_list in catsets_dict.items():
        for found_id_list in found_categories:
            for found_id in found_id_list:
                if found_id in categories_list:
                    result[key] += 1
    return result
    
def get_venues_near_location(lat, lon, client_id, client_secret, radius=250, limit=100):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, radius, limit)
    response_json = requests.get(url).json()['response']
    key_list = response_json.keys()
    try:
        item_list = response_json['groups'][0]['items']
        venue_categories = [get_categories(item['venue']['categories']) for item in item_list]
    except KeyError:
        venue_categories = []
    return venue_categories, response_json

Unfortunately, the open account at foursquare only grants 950 regular calls to its server per day. Since we have 2791 tiles in our grid, we will need to retrieve and save the data to disc on several days.

In [14]:
try:
    # read file if it already exists
    df_tiles=pd.read_csv('tiles.csv')
    unnamed_cols = [c for c in df_tiles.columns if str(c).find('named:')>0]
    df_tiles.drop(unnamed_cols,axis=1,inplace=True)
    print('data read')
except:
    # else continue the file is created later
    for key in feature_list:
        df_tiles[key] = np.zeros(no_tiles)
    df_tiles['dl_done'] = np.zeros(no_tiles)
    print('no data read')
    
data read

Now we need to iterate the data getter function over all the remaining tiles plotted on the map. Those are the ones were the dl_done column entry is 0.

In [15]:
done_mask = df_tiles['dl_done'].to_numpy()
no_remaining_tiles = no_tiles - done_mask.sum()
index_coord = np.stack(zip(range(no_tiles),center_array))
to_do = max(700,no_remaining_tiles)
remaining_coords = index_coord[np.where(done_mask==0)]
if len(remaining_coords)==0:
    print('all data already downloaded')
all data already downloaded
c:\venv\dev\lib\site-packages\IPython\core\interactiveshell.py:3249: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  if (await self.run_code(code, result,  async_=asy)):
In [16]:
counts_array = np.zeros((no_tiles, len(feature_list)+1))
for i, key in enumerate(feature_list+['dl_done']):
    counts_array[:,i] = df_tiles[key]
    
for i, coord in remaining_coords[:to_do]:
    print('{}'.format(i))
    foursquare_result = get_venues_near_location(lat=coord[1],
                                                 lon=coord[0],
                                                 client_id=client_id, 
                                                 client_secret=client_secret, 
                                                 radius=250, limit=100)
    
    found_categories = foursquare_result[0]
    counts = count_categories(catsets_dict=catsets_dict, found_categories=found_categories)        
    for j, key in enumerate(feature_list):
        counts_array[i,j] = counts[key]
        counts_array[i,-1] = 1
In [ ]:
 

Lets update our pandas dataframe, and save it to disk, so we dont need to recall foursquare unneccessary:

In [17]:
for j, key in enumerate(feature_list+['dl_done']):
    df_tiles[key] = counts_array[:,j]
df_tiles['all']=counts_array[:,:].sum(axis=1)
unnamed_cols = [c for c in df_tiles.columns if str(c).find('named:')>0]
df_tiles.drop(unnamed_cols,axis=1,inplace=True)
df_tiles.to_csv('tiles.csv', index=True, header=True)
In [18]:
df_tiles=pd.read_csv('tiles.csv')
unnamed_cols = [c for c in df_tiles.columns if str(c).find('named:')>0]
df_tiles.drop(unnamed_cols,axis=1,inplace=True)
In [19]:
df_tiles.head()
Out[19]:
coords_str lat lon distance_to_center p q r dl_done asian european fast night_life other sweet all
0 30_-30_0 48.841319 2.391292 5196 30 -30 0 1.0 3.0 2.0 0.0 1.0 0.0 1.0 8.0
1 29_-30_1 48.840806 2.389942 5111 29 -30 1 1.0 4.0 2.0 1.0 2.0 0.0 1.0 11.0
2 28_-30_2 48.840293 2.388593 5031 28 -30 2 1.0 3.0 2.0 1.0 1.0 0.0 1.0 9.0
3 27_-30_3 48.839780 2.387243 4956 27 -30 3 1.0 1.0 1.0 1.0 0.0 0.0 0.0 4.0
4 26_-30_4 48.839268 2.385893 4886 26 -30 4 1.0 1.0 1.0 1.0 0.0 0.0 0.0 4.0

Lets define an order of features and associated color code:

In [20]:
feature_list = ['european',
                'fast',
                'night_life',
                'asian',
                'other',
                'sweet',
                'all']
color_list = ['Blues',
              'Reds',
              'Greens',
              'RdPu',
              'Purples',
              'Oranges',
              'Greys']

Lets see how many restaurants we tend to find:

In [21]:
for feature in feature_list:
    print('max of {} is {}'.format(feature, df_tiles[feature].max()))
max of european is 27.0
max of fast is 21.0
max of night_life is 25.0
max of asian is 30.0
max of other is 10.0
max of sweet is 14.0
max of all is 72.0

This looks very reasonable. We get sufficiently large counts, but not too large indicating that the spatial and topical cathegories are a well proportioned.

In [22]:
df_features = df_tiles[feature_list]
df_features.head()
Out[22]:
european fast night_life asian other sweet all
0 2.0 0.0 1.0 3.0 0.0 1.0 8.0
1 2.0 1.0 2.0 4.0 0.0 1.0 11.0
2 2.0 1.0 1.0 3.0 0.0 1.0 9.0
3 1.0 1.0 0.0 1.0 0.0 0.0 4.0
4 1.0 1.0 0.0 1.0 0.0 0.0 4.0
In [23]:
f, ax = plt.subplots(figsize=(12, 8))
sns.boxplot(data=df_features, palette=['blue','red','green','magenta','navy','orange','grey'])
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x20912b29508>

We see we have some areas with very high numbers and diversity of restaurants. This follows from the fact that the sum category 'all' is much greater than each individual category. However, all categories have some areas where they have large counts and esp. 'european', 'fast', 'asian' and 'night_life' have a non-negligible median value. All of these factors point to a promising dataset for further evaluation.

3.3 Heat map

back to top

The next step will be to plot the heat maps for all restaurant and bar categories around Paris!

In [24]:
map_paris = folium.Map(location=center_of_paris, zoom_start=13)
# Add the color for the chloropleth:

for name, color in zip(feature_list, color_list):
    folium.Choropleth(
        geo_data=json_tiles_fname,
        name=name,
        data=df_tiles,
        columns=['coords_str', name],
        key_on='feature.properties.coords_str',
        fill_color=color,
        fill_opacity=0.4, 
        line_opacity=0.0,
        legend_name='Number of {} venues'.format(name),   
    ).add_to(map_paris)
folium.LayerControl().add_to(map_paris)

map_paris
Out[24]:

As expected we have a non-uniform distribution of the different categories. We can see darker areas where the total number of venues is high and some bland areas, esp. in the outskirts where there is not much going on. If you are running the notebook you can play with different color overlays. There are slightly different hues representing changing ratios of types of venues, but this is not the best representation of the changing mix of venues. To make variance in regions more clear we will use the k-means algorithm next to group (and colour) similar areas together.

3.4 k-means

back to top

We will next try to group different areas around Paris by similarity according to the venue makeup. Areas with a similar mix of restaurants or bars will probably attract a similar kind of patron. Our clients can thus group their sales according to the type of area adn find out whether there is a particular area makup that they are more sucessful in. Also, maybe there is a very similar mix somehwere else, that they don't yet cater for!

The k-means algorithm works best with standardised data, therefore we create a standardised dataframe. We (somewhat arbitrarily for now) choose k = 5. Please note that if you run this notebook, the k-means algorithm will converge in a different order each time so that we will only be able to reference the specific areas when we find a stable way to identify them.

In [25]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

no_clusters = 6
category_color_list = ['red', 'blue', 'green', 'magenta', 'orange', 'yellow']
features_array = np.asarray(df_tiles[feature_list])
X = StandardScaler().fit_transform(features_array)
k_means = KMeans(init = "k-means++", n_clusters = no_clusters, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
df_tiles['area_label'] = labels

map_paris = folium.Map(location=center_of_paris, zoom_start=13)
# Add the color for the chloropleth:

folium.Choropleth(
    geo_data=json_tiles_fname,
    name=name,
    data=df_tiles,
    columns=['coords_str', 'area_label'],
    key_on='feature.properties.coords_str',
    fill_color='Set1',
    fill_opacity=0.4, 
    line_opacity=0.0,
    legend_name='Area Label'  
).add_to(map_paris)
folium.LayerControl().add_to(map_paris)

map_paris
Out[25]:

Now we clearly see areas cathegoriezed using the venue data. As the river and outlying areas are grouped in one category, we can assume this group collects areas with a low number of venues. But, which of the areas is the best to sell Roses?

4 Results

back to top

Looking at the heat maps in section 3.3 we find that there is quite a bit of variety in the distribution of venues around Paris. This may well be expected as some areas are very touristic, featuring many places to make money as a Rose-seller, while others are more residential.

Lets plot the data per category to better understang what makes up each category as identified by k-means:

In [26]:
sns.set(style="whitegrid", palette="muted")
df_features = df_tiles[feature_list+['area_label']][:]
# print(df_features.head())
# "Melt" the dataset to "long-form" or "tidy" representation
df_features = pd.melt(df_features, id_vars='area_label')
# print(df_features.head())
# Draw a categorical scatterplot to show each observation
f, ax = plt.subplots(figsize=(12, 8))
sns.swarmplot(data=df_features, hue='area_label', y='value', x='variable',
              palette=category_color_list,size=5)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x20918d4bb08>

We clearly see two colors on the lower end of the counts. Also some categories ('sweet', 'asian' and 'night-life') are clearly dominated by one color, while 'european and 'fast' has a mix of colours on the top of the counts pile. To go further into this we can plotting the average count per categorie and k-means label:

In [27]:
sns.set(style="whitegrid", palette="muted")
df_mean = df_features.groupby(by=['area_label','variable'],as_index=False,sort=False).mean()
# print(df_mean.head())
# Draw a categorical scatterplot to show each observation
f, axes = plt.subplots(nrows=len(feature_list), ncols=1, sharex=True,figsize=(8, 12))
for i, feature in enumerate(feature_list):
    i_data = df_mean.where(df_mean['variable']==feature) 
    sns.barplot(data=i_data, palette=category_color_list, y='value', x='area_label', ax=axes[i])
    axes[i].set_ylabel(feature)
    if i<no_clusters:
        axes[i].set_xlabel('')

As expected from the map view, there are two groups which have a very low count of venues 'all'. These will be named 'low0' and 'low1'. The other areas all have more than 20 locations on average and will be named 'A', 'B', 'C' and 'D', respectively. Three of them can be clearly identified as the groups with most counts in 'asian' -> 'A', 'sweet' -> 'B', 'night-life' -> 'C', with the last remaining group labeled 'D'

In [28]:
area_name_list = ['A', 'B', 'C', 'D']
area_name_dict ={}
i_data = df_mean.where(df_mean['variable']=='all')
i_data = i_data.dropna().sort_values('value')
area_name_dict.update(dict(zip(i_data['area_label'].values[:2],['low0','low1'])))

i_data = df_mean.where(df_mean['variable']=='asian')
i_data = i_data.dropna().sort_values('value')
area_name_dict.update({i_data['area_label'].values[-1]:'A'})

i_data = df_mean.where(df_mean['variable']=='sweet')
i_data = i_data.dropna().sort_values('value')
area_name_dict.update({i_data['area_label'].values[-1]:'B'})

i_data = df_mean.where(df_mean['variable']=='night_life')
i_data = i_data.dropna().sort_values('value')
area_name_dict.update({i_data['area_label'].values[-1]:'C'})

other_no = [x for x in np.arange(6) if x not in area_name_dict.keys()]

area_name_dict.update({float(other_no[0]):'D'})
interesting_areas = ['A','B','C','D']
print(area_name_dict)
{4.0: 'low0', 0.0: 'low1', 5.0: 'A', 2.0: 'B', 3.0: 'C', 1.0: 'D'}

From the k_means algorithm we have the information of the cluster centers in the normalized parameter space (called X). We have defined four interesting regions (A to D) which feature a high denisty of venues. Lets see which the distribution of venue types in these areas:

In [29]:
fig, axes = plt.subplots(nrows=4,ncols=1,
                         sharex=True,figsize=(8, 12))
ax_count=0

for label, area_name in area_name_dict.items():
    # not low0 or low1
    if area_name in interesting_areas:
        center = k_means.cluster_centers_[int(label)]
        data = np.stack([range(len(center)),
                        center])
        axis = axes[ax_count]
        sns.barplot(x = np.arange(len(center)), y =center,
                    palette=category_color_list,
                    ax=axis)
        axis.set_ylabel(area_name)
        if ax_count<len(interesting_areas)-1:
            axes[ax_count].set_xlabel('')
            axes[ax_count].set_xticklabels([])
        else:
            axes[ax_count].set_xlabel('feature')
            axes[ax_count].set_xticklabels(feature_list)
        ax_count += 1    
    
    

With this distribution the rose-seller can now decide which type of area is better for his business. Maybe start in the afternoon in an area of type 'B' where many restaurants are offering sweet treats and some unsuspecting tourist may be hanging out. As evening comes, go to areas like 'A' or 'D' which feature restaurants where potential customers will be having dinner. Afterwards, finish the nights work with a tour of the bars and nightclubs in areas like 'C'.

We next calculate how close in the parameter space (X) ALL areas of the map are with respect to the identified intersting areas. We will start with the euclidian distance (numpy.linalg.norm) from the center of the k_means group as a measure for similarity. To get a sensible similarity measure we add 1 to the distance, invert the measure and normalize to 1: similarity ~ 1/(dist + 1) normalized to give us a measure between 1 (similar) and 0 (dissimmilar).

In [30]:
for label, area_name in area_name_dict.items():
    if area_name in interesting_areas:
        x_space = np.copy(X)
        center = k_means.cluster_centers_[int(label)]
        x_space += -center
        distance = np.linalg.norm(x_space,axis=1)
        similarity = 1./(distance+1)
        #normalize:
        similarity = similarity - min(similarity)
        similarity = similarity / max(similarity)
        df_tiles['similarity_to_{}'.format(area_name)] = similarity
In [31]:
map_paris = folium.Map(location=center_of_paris, zoom_start=13)
# Add the color for the chloropleth:

for color, area_name in zip(['Reds', 'Blues', 'Greens', 'Greys'],['A','B','C','D']): 
    folium.Choropleth(
        geo_data=json_tiles_fname,
        name=area_name,
        data=df_tiles,
        columns=['coords_str', 'similarity_to_{}'.format(area_name)],
        key_on='feature.properties.coords_str',
        fill_color=color,
        fill_opacity=0.6, 
        line_opacity=0.0,
        legend_name='Similarity to {}'.format(area_name)
    ).add_to(map_paris)
folium.LayerControl().add_to(map_paris)

map_paris
Out[31]:

5 Discussion

back to top

There are a few assumptions and short cuts in this analysis that have to be mentioned.

First, the foursquare response actually returns venues in a circle around the given location (radius given), not in the plotted hexagon. I have tried to keep the overlap minimal without misssing to much coverage. A more accurate approach may decrease the noise in the heatmaps and segmented data. Also, applying a spatial smoothing algorithm over the ound restaurant distribution may reduce some of the noise and produce a smoother looking map. Clever smoothing and decreasing the hexagon size could be pursued in parallel, without reducing the number of counts per evaluated entry. This will increase the number of calls to foursquare even further though.

Next, both the feature list and the k-means algorithm with k=6 was somwhat arbitrarily choosen. The number of features isn't too large yet for distance-based clustering, so a finer comb might yet reveal interesting results.

As far as the areas that are interesting to sell in are concerned both points are unlikely to have an effect. The observed differences between the cathegorized areas are quite apparent. The main deciding factor will remain the density of venues overall. However, the k-means results show that there are clearly distinct mixes of venue types which we revealed and located on the map.

From the presented analysis a possible strategy for the sales operation during one night would look like this:
16h: Go to the area around Saint Michel and the Odeon (many sweet stores) to catch customers after their refreshments. (like B)
19h: Start working your way across the Seine to Les Halles and the 2nd Arrondissment targeting restaunts. (like D)
22h: Turn right to Grands Boulevards, Bonne Nouvelle, Strasbourg Saint-Dennis and Etienne Marcel to finish the night selling to customers visiting night_life venues (like C)

6 Conclusion

back to top

When looking where to sell Roses in Paris, areas with a high number of venues where potential customers would be are good areas to target. This report shows clearly that there are some well resolved areas with a high venue density.

Going further than just the density of venues, the character of different areas with high restaurant density can be identified by looking at the relative number of venues according to the categories 'european', 'fast', 'night_life', 'asian', 'other' and 'sweet'. Using these there are four main types of areas: those with a high proportion of 'asian', 'sweet' and 'night_life' venues, and one type with an even mix of venues. These different types of venues were identified and located so that the customer can approach them at optimized times and with the appropriate marketing strategy.